Solubility of gaseous hydrocarbons in ionic liquids using equations of state and machine learning approaches

Ionic liquids (ILs) have emerged as suitable options for gas storage applications over the past decade. Consequently, accurate prediction of gas solubility in ILs is crucial for their application in the industry. In this study, four intelligent techniques including Extreme Learning Machine (ELM), Deep Belief Network (DBN), Multivariate Adaptive Regression Splines (MARS), and Boosting-Support Vector Regression (Boost-SVR) have been proposed to estimate the solubility of some gaseous hydrocarbons in ILs based on two distinct methods. In the first method, the thermodynamic properties of hydrocarbons and ILs were used as input parameters, while in the second method, the chemical structure of ILs and hydrocarbons along with temperature and pressure were used. The results show that in the first method, the DBN model with root mean square error (RMSE) and coefficient of determination (R2) values of 0.0054 and 0.9961, respectively, and in the second method, the DBN model with RMSE and R2 values of 0.0065 and 0.9943, respectively, have the most accurate predictions. To evaluate the performance of intelligent models, the obtained results were compared with previous studies and equations of the state including Peng–Robinson (PR), Soave–Redlich–Kwong (SRK), Redlich–Kwong (RK), and Zudkevitch–Joffe (ZJ). Findings show that intelligent models have high accuracy compared to equations of state. Finally, the investigation of the effect of different factors such as alkyl chain length, type of anion and cation, pressure, temperature, and type of hydrocarbon on the solubility of gaseous hydrocarbons in ILs shows that pressure and temperature have a direct and inverse effect on increasing the solubility of gaseous hydrocarbons in ILs, respectively. Also, the evaluation of the effect of hydrocarbon type shows that increasing the molecular weight of hydrocarbons increases the solubility of gaseous hydrocarbons in ILs.

www.nature.com/scientificreports/ (SVM) were used to develop the group contribution (GC) method. The results showed that the ANN-GC model had the best performance with MSE and R 2 values of 0.0202 and 0.9836, respectively. In 2021, Moosanezhad-Kermani et al. 34 developed a group method of data handling method for prediction CO 2 solubility in ILs. In 2022, Mohammadi et al. 35 predicted SO 2 solubility in ILs using four intelligent models and equations of state. In this study, using an extensive data bank including 2145 experimental data points, four intelligent models including Extreme Learning Machine (ELM), Deep Belief Network (DBN), Multivariate Adaptive Regression Splines (MARS), and Boosting-Support Vector Regression (B-SVR) are developed to predict the solubility of gaseous hydrocarbons (CH 4 , C 2 H 6 , C 3 H 8 , C 4 H 10 , C 2 H 4 , C 3 H 6 , C 4 H 8 , and C 6 H 6 ) in ILs in a wide range of temperature and pressure. In this regard, intelligent models are trained based on two strategies, including Model (I): based on the thermodynamic properties of ILs and hydrocarbons, and Model (II): based on the chemical structure of ILs and hydrocarbons. In the first method, the critical properties of ILs and hydrocarbons and their molecular weights are considered as input parameters to the model. In the second method, the effect of the substructures that make up ILs and gaseous hydrocarbons are considered. In the development of existing models, the data is divided into an 80/20 ratio for the training and testing stages. Results are compared with previous studies and equations of state to evaluate the precision of the developed model. Furthermore, the solubility trend study is also performed to investigate the effect of different parameters on the solubility of gaseous hydrocarbons in ILs.
Data collection. In this study, 2145 experimental data for the solubility of gaseous hydrocarbons (CH 4 , C 2 H 6 , C 3 H 8 , C 4 H 10 , C 2 H 4 , C 3 H 6 , C 4 H 8 , and C 6 H 6 ) in ionic liquids over a wide range of temperature (280-453.15 K) and pressure (0-201.64 bar) have been collected from the literature 3,4,12,13,[36][37][38][39][40][41][42][43][44][45][46][47][48][49][50][51][52][53][54] . The details of the provided database are summarized in Table 1. In the method based on thermodynamic properties, temperature, pressure, molecular weight, boiling point, and critical properties of ILs and hydrocarbons are considered as input parameters, while in the method based on chemical structure, in addition to the substructures of ILs and hydrocarbons, temperature and pressure are considered as input parameters. The statistical description of the input parameters and a number of substructures used in this study are reported in Table 2 and Table 3, respectively. The data are then divided into an 80/20 ratio in the training and testing stages, respectively. Figure 1 illustrates the chemical structure of anions and cations used in this study.

Modeling
Multivariate adaptive regression spline (MARS). The multivariate adaptive regression spline (MARS), which is comprised of a count of different basis functions, will be used for regressive assessment and the identification of estimation techniques. The MARS model can calculate the enhanced structure of the multiplication of spline basis functions (BFs). The MARS system may choose the count of BFs and the properties related to each one, like the multiplication degree and node positions, autonomously. We might mention recurrent regression as one of the various clustering techniques. To properly describe higher-order responses, the MARS model employs the backward stepwise approach 55,56 . The MARS concept has several strengths. These benefits provide several items that will be detailed further below. This model's practical property is that the count of parameters in this mathematical equation is reduced, which speeds up the calculation of the body while reducing simulation precision. Conserving computing resources is extremely desired for researchers. This model's clever method of dividing factors into two groups of crucial factors and supplementary factors, which maximize detection accuracy, is another noteworthy aspect. The mathematical look of this concept is one of its most appealing qualities 57 . The dependent and independent components are identified by the mathematical expression form 58 . There is no supposition for linking inputs and output among the premises utilized in the MARS system. The purpose of this analysis is to look into the relationship between the output y and inputs {x 1 ,…, x n } based on previous revelations. The following is a description of the data generation system 59 : e refers to the relative error that matches across the intended region (x 1 ,…, x n ) in which the MARS model is generated. A progressive linear regression technique is a frequent model-building approach that MARS employs. MARS employs an automated way to replace major inputs. A succession of comparable splines and their derivatives is created to provide these possibilities. The splines are coupled and form polynomial curves to construct an ideal model. BFs are the name for these polynomials. They can simulate both linear and nonlinear processes. BFs are often split into couples, each piecewise linear or cubic in its own right. They have nodes at a certain place, each of which is related to the inputs. Below equation gives the shape of distributed linear BFs as 59 : The positive component is denoted by "+", while the node is denoted by "t". The MARS model is calculated using the following formula in its standard form 59 : In the above equation, X is the input vector, and B m , β m , and β 0 are mth BF, coefficient of mth BF, and the constant of the equation, respectively.
For the MARS approach, there are two stages to follow 59 : www.nature.com/scientificreports/ Stage 1: In the targeted area, a collection of BFs is generated. This process begins with a structure that just has the intercept term and adds the phrases one at a time. BFs should be used for the phrases. The forward step can be stopped in two ways: the first is when the maximum number of BFs is reached, and the second is when the regression coefficient of modifications is less than the set value. Several BFs are consecutively utilized, resulting in an overfitting system.

Stage 2:
Overfitting is undesired and must be avoided at all costs. Because the system uses a monitoring phrase, it must be condensed by deleting at least one more relevant BF at a time. Then, from among the best systems of any size, the system with the least general cross-validation (GCV) is chosen to complete the system. This procedure concludes with the creation of a final model. GCV is used in the MARS system to eliminate BFs that aren't needed. A GCV expression is as follows 60,61 : MSE train represents the mean square error when utilizing training datasets, N represents the size of observations, and enp indicates the number of factors in the practical circumstance. Figure 2 illustrates the schematic of the MARS model.
Extreme learning machine (ELM). Huang G.B. invented the extreme learning machine (ELM) to minimize time-consuming repeated training and improve segmentation accuracy 62,63 . The ELM architecture consists of an input layer, a concealed layer, and an output layer. It is a single-hidden-layer feedforward neural network (SLFN). Unlike conventional learning methods (like the Backpropagation), which arbitrarily established all system training variables and quickly build local best solutions, the ELM just adjusts the neurons in the invisible layer, randomizes the weights between the visible and invisible layers, and also the bias of the hidden neurons in the method implementation stage, determines the hidden layer output matrix, and subsequently finds the weights. The ELM provides the prospect of quick learning speed because of its straightforward network archi-  Table 3. Different substructures of ionic liquids and hydrocarbons. www.nature.com/scientificreports/ tecture and succinct variable calculation operations. Figure 3 depicts the architectural framework of ELM. Figure 3 shows the construction of an extreme learning machine system, which has n input layer neurons, l hidden layer neurons, and m output layer neurons. Given the training set {X, Y } = x i , y i (i = 1, . . . , Q) , with an input parameter X = x i1 x i2 . . . x iQ and an output matrix Y = y i1 y i2 . . . y iQ composed of the training set, the matrix X and the matrix Y may be represented as follows 64 : www.nature.com/scientificreports/ here, n and m are the dimensions of the input and output matrices, respectively. The ELM then adjusts the weights between the input and hidden layers at random 64 :

Substructures
The weights between the jth input layer neuron and the ith hidden layer neuron are represented by w ij . Third, the ELM makes the following assumptions about the weights between the hidden and output layers 64 : The weights between the jth hidden layer neuron and the kth output layer neuron are represented by β jk . Fourth, the ELM sets the buried layer neurons' bias at random 64 : www.nature.com/scientificreports/ Fifth, the ELM selects the g(x) system activation function.
The output matrix T may be represented as follows according to Fig. 3: The output matrix T's column vectors are as follows 64 : Sixth, have a look at Eqs. (9) and (10) to see what we can come up with: here, T ′ is the transpose of T and H is the concealed layer's output. We utilize the least square approach to determine the weight matrix values of β 62 to find a particular solution with the least amount of error.
We apply a regularization element to the β 65 to increase the network's generalization capabilities and make the findings more robust. When the count of concealed layer neurons is smaller than the number of training sets, β can be calculated as: When the number of concealed layer nodes exceeds the number of training data, β is 64 : 66 have written a detailed description of SVR methodology. As a result, this study mainly provides a basic overview of SVR. SVR fits the linear relationship y = x T p + b regarding the designation error reduction theory, in which the regression parameters p and b are determined by minimizing the given cost function R 67 : Therefore, by employing the optimization criteria and inserting Lagrange multipliers α * j , α j linear relationship y = x T p + b has the given explicit expression 67 :

Support vector regression (SVR). Smola and Scholkopf
By replacing a kernel function K x j , x for the dot product x j , x , this linear approach may be extended to nonlinear regression in a direct way. As the kernel function, a variety of functions can be employed. The Gaussian radial basis function K x j , x = exp −�x j − x�/ 2γ 2 is a regularly utilized kernel function 67 . Figure 4 shows the flowchart of the SVR approach.

Boosting support vector regression (B-SVR).
For the regression problem, many implementations of boosting were utilized, including adaptive boosting for regression 68 and Stochastic gradient boosting 69 . The essential objective of B-SVR is to train a series of SVR systems, each of which is built often using data with high mistakes gained from the prior SVR designs. The following is a description of the technique 67 .
To begin, all of the data in the initial training dataset are given identical weights w  www.nature.com/scientificreports/ J samples (sometimes with repetition, i.e., some values that occur several times in the boosting collection) are chosen from the main training samples according to the selection probability functions f(t).
(Second stage): Utilizing boosting dataset, create an SVR network and predict the outputs of the main training data.
(Third stage): Using the given loss function, calculate a loss value for each item in the initial training dataset 67 : (Fourth stage): Determine the mean loss: and modify the weight of each component in the previous training dataset utilizing 67 : The notation (1 − L j ) indicates that β t has been increased to the (1 − L j )th power. This weight-updating technique indicates that if the SVR system generated in the second stage fails to estimate data, its weight will be raised. β t is a measure of the developed SVR model's reliability. A high β t suggests that the SVR system has a poor level of certainty. Because the quantity of w (t+1) j is required to calculate f (t+1) j in the first stage of the next computation phase (t + 1), Eq. (22) yields a transition to the next calculation cycle (t + 1). T is chosen by the training collection's root mean squared error (RMSE) 70 , and boosting cycle is performed by Drucker 71 when the average loss is less than 0.5 67 .
Finally, in terms of estimation, each of the T SVR systems produces a forecasting y (t) j for the jth unknown feature and a matching β t . These T forecasts are merged to provide a target variable utilizing the weighted median method, which is calculated as follows: Arrange these T forecasts, y j . From the lowest term n 1 to the rth term n r , add the sum of log( 1 β nj ) through j until the following inequality is satisfied 67 : The overall estimate for the jth data is then extracted from the forecast from the n r th SVR system 71 . Utilizing the same technique as before, group estimates of outputs for additional inputs may also be produced 67 .

Deep belief network (DBN). Artificial Neural Networks (ANNs) have been initially suggested in the
1960s and now have grown in prominence in a variety of machine learning approaches, including classification, clustering, regression, and forecasting 72 . ANNs are flexible analytical approaches that can build complicated associations between real-world information. Among the several types of ANNs, the Back-Propagation (BP) mechanism is one of the most common structures. Although, it should be highlighted that using casual weights at the beginning of the training process is one of the difficulties with BP. As a consequence of this challenge, a unique technique, Deep Belief Network (DBN), for deep network pre-training was devised in 2006 73 , leading to significant advances in machine learning.
A DBN is comprised of several Restricted Boltzmann Machines (RBMs), which are unattended creative prediction model that uses one hidden layer to mimic the presumed distribution of measurable variables. Our DBN uses a series of RBMs to identify relevant architectures, which extract high-level properties from original data. Figure 5 shows a DBN in graphical form 74 .
DBN training is divided into three phases 75 : 1. Conduct unsupervised and hungry layer-wise pre-training utilizing a collection of RBMs. 2. After haphazardly setting the linking weights matrix between the last hidden layer and the output neuron, calculate the error (first fine-tuning stage). www.nature.com/scientificreports/ 3. As a second fine-tuning stage, use error Back Propagation. The RBM training technique is described as follows: Restricted Boltzmann machine. The encoder/decoder technique used by the Restricted Boltzmann Machine turns inputs into an increased attributes statement (Fig. 6). The decoder may then retrieve the inputs. RBM growth through the reproduction of inputs is a major feature of DBN because this strategy is unregulated and does not need marked data 76    here, Z is the normalizing factor and denotes: In this work, we have used intelligent models based on two different methods. The reason for choosing the thermodynamic properties-based method is to make a model whose inputs are the same as the equations of state so that it is simple and a reasonable comparison can be made. The reason for choosing the method based on chemical structure is to make a model that is more general and can be used for different ionic liquids and operating conditions.

Equations of state (EOSs)
Equations of state (EOSs) are utilized to determine the thermodynamic characteristics of pure substances and mixtures, particularly for processes at pressures higher than 10 bar, since in these circumstances at least one constituent exists in supercritical phase in most cases. The cubic EOSs are the most common types of EOS. The van der Waals EOS, proposed in 1873 78 as the first model to represent features of both liquid and vapor phases, is the source of these approaches. Numerous EOS have been built on top of the van der Waals EOS in recent decades. For instance, there are over 220 versions and numerous researches that are relevant to parameterization and application to compounds for the widely used Peng-Robinson EOS. Cubic EOS has become quite widespread in the oil and gas industry because they are straightforwardly formulated mathematically and provide good thermodynamic property relationships and forecasts for compounds of different fluids 79 . In this study, to evaluate the performance of intelligent techniques, the obtained results are compared with equations of state. In this regard, four equations of state including Peng-Robinson (PR), Soave-Redlich-Kwong (SRK), Redlich-Kwong (RK), and Zudkevitch-Joffe (ZJ) have been used. The mathematical relations of the equations mentioned are summarized in Table 4.

Results and discussion
Statistical analysis. In this study, the solubility of gaseous hydrocarbons in ILs was predicted using 2145 laboratory data with the four intelligent models already mentioned. For this purpose, two different methods based on thermodynamic properties and the chemical structure of ILs and hydrocarbons have been used. The process of model development and estimation of the solubility values is illustrated in Fig. 7. Statistical criteria Pc ZJ: Zudkevitch-Joffe www.nature.com/scientificreports/ comparing the performance of intelligent models are root mean squared error (RMSE) and coefficient of determination (R 2 ), the mathematical relationships of which are defined as follows 24,80 : where HS iexp , HS ipred , HS iexp , and n refer to experimental hydrocarbon solubility, predicted hydrocarbon solubility, mean value of experimental hydrocarbon solubility, and total number of data points, respectively. The results of statistical analysis of the developed models are reported in Table 5. According to the obtained results, the DBN model has the best performance in both methods. In the method based on thermodynamic properties, the DBN model has the highest accuracy with R 2 and RMSE values of 0.9961 and 0.0054, respectively. The   Figure 8 shows a cross-plot diagram for models developed based on thermodynamic properties. As can be seen from Fig. 8, the DBN model has higher density points around the X = Y line, which indicates the higher accuracy of this model than other models, although other models are also in an acceptable situation. Figure 9 shows the cross-plot diagram of the results of the method based on chemical structure, temperature, and pressure. The graphical analysis presented in Fig. 9 also confirms the results of Table 5, and the DBN model has a more appropriate cross plot diagram than the other models. Figures 10 and 11 illustrate the distribution of the deviation of the predicted values from the actual values for Model (I) and Model (II), respectively. The higher the focus around the zero line, the closer the predicted values are to the actual values and the more accurate the model. Therefore, as can be seen from Figs. 10 and 11, the DBN model is in a better position than the other smart models in both methods.
The Taylor diagram 81 was developed in Fig. 12, which integrates a number of statistical variables for a more understandable representation. This illustration depicts all of the smart approaches in respect of how well they forecast gaseous hydrocarbon solubility in various ILs. Three performance parameters [standard deviation (SD), RMSE, and R 2 ] of each network, including DBN, ELM, B-SVR, and MARS, are utilized to quantify the degree of discrepancy between the systems' estimates and the related actual values. The reference point is a distance from the full measure that is used to calculate the centered RMSE. Another reference is the optimal estimation technique, which is denoted by a point with R 2 equal to 1. Figure 12a,b show the Taylor diagram of the DBN model for the results of Model (I) and Model (II), respectively. In Fig. 12a, the RMSE and R 2 values are 0.0054 and 0.9961, respectively. In Fig. 12b, the RMSE and R 2 values are 0.0065 and 0.9943, respectively. Figure 12 depicts the performance parameters of the systems after they have been encapsulated within the Taylor diagram. The Taylor diagram in this picture verifies DBN's dominance by demonstrating that their worldwide performance assessment is the most accurate.
Equations of state evaluation. Equations of state are one of the conventional methods for calculating the solubility of gases in ILs. Comparing the results of smart models with equations of state can reveal the efficiency and performance of smart models well. For this purpose, four equations of state including Peng-Robinson (PR), Soave-Redlich-Kwong (SRK), Redlich-Kwong (RK), and Zudkevitch-Joffe (ZJ) have been used to calculate the solubility of gaseous hydrocarbons in ILs. Figure 13 shows the results obtained for two systems ([BMIM][PF 6 ]-CH 4 and [BMIM][PF 6 ]-C 2 H 6 ) at 323.15 K. As can be seen from Fig. 13, with increasing pressure, the solubility of gaseous hydrocarbons increases, a trend that can also be seen in the equations of state. Also, the solubility of C 2 H 6 in [BMIIM][PF 6 ] is higher compared to CH 4 . Another noteworthy point is the very good agreement of the results of the smart models with the experimental data. It can also be found that the equations of state overestimate the solubility and the difference between the calculated values and the experimental data is also significant.
Hamedi et al. 5 calculated the solubility of methane in ILs using two equations of state including modified Sanchez and Lacombe (SL) and cubic-plus-association (CPA). In this study, 649 experimental data related to 19 ionic liquid and methane systems were used. Figure 14 shows a comparison of current study models and Hamedi et al. results. As can be seen, the accuracy of the intelligent models compared to the equations of state used by Hamedi et al. 5   www.nature.com/scientificreports/ process, but in simpler terms, changing solubility factors (such as Henry's constant) due to changes in physical and chemical properties cause these trends 82,83 . Enthalpy, entropy, and interactions between gas and ionic liquid are the most important factors affecting the solubility of gases in ILs. Therefore, the effect of changes in pressure, temperature, type of anion and cation on solubility is affected by changes in enthalpy and entropy. In other words, the endothermic or exothermic nature of the dissolution process, which is also depends on the interactions between the gas and the ionic liquid, affects how the gas solubility changes with the increase of each of the mentioned parameters 84 .
Pressure and anion type effect. Previous studies have shown that although ILs have similar solubility at low pressures 85 , increasing pressure leads to increasing the solubility of gases in ILs and the rate of increase varies for different ILs 86 . Figure 15 shows the changes in the solubility of gaseous hydrocarbons with increasing pressure for a number of systems used in this study. As can be seen, increasing the pressure increases the solubility of gaseous hydrocarbons in ILs. The effect of the type of anions existing in the structure of ILs has been studied in many studies 86 . As mentioned in the literature, the main reasons for changes in the solubility due to changes in pressure and anion type are enthalpy and entropy changes, and the interactions between gas and ILs 84 . The effect of a number of anions has been investigated for example. Investigation of the systems reported in Fig. 15 shows that the effect of anion type on the solubility of C 2 H 6 in ILs is as follows: Cation effect. In previous studies, the influence of cation type has been investigated 43   www.nature.com/scientificreports/ effect of a number of cations on the solubility of gaseous hydrocarbons in ILs has been studied. Figure 16 examines a number of systems studied in this work. According to the results obtained for the proposed systems, the effect of cation type on the solubility of methane in ILs is as follows: Temperature and alkyl chain length effect. Many researchers have studied the effect of temperature 86 and alkyl chain length 87 on the solubility of gases in ILs. The results show that increasing the temperature reduces the solubility 86 . Generally, the dependency of solubility on temperature is thermodynamically related to the enthalpy of dissolution. For example, in exothermic processes (such as dissolution of CO 2 in ILs) the solubility of the gas decreases with increasing temperature, and in endothermic processes (such as dissolution of N 2 in ILs) the increase in temperature increases the solubility 83,84 . Also, increasing the alkyl chain length increases the solubility of hydrocarbons in ILs 87 . The reason for this subject is the increased entropy of solvation for ILs 84 . According to Fig. 17, although increasing the alkyl chain length does not show a specific trend in ethane solubility in the proposed systems, increasing the temperature has reduced the ethane solubility in the proposed systems.
Effect of the type of hydrocarbon. In order to investigate the effect of hydrocarbon type on the solubility of gaseous hydrocarbons, four different systems have been investigated. Figure 18 shows the trend of solubility changes for these four systems with increasing pressure at 320 K.    4 , and the solubility of methane in ionic liquids corresponded to the enthalpy of adsorption. Also, its solubility in ionic liquids with longer non-polar alkyl chains is higher. Thus, entropic effects and solvent-solvent interactions prevailed for CO 2 and CH 4 , respectively. The solubility of hydrocarbons in ILs is important in many ways. The selection of ILs as solvents in reactions involving permanent gas and gas storage media is strongly influenced by the degree of gas solubility in ILs and is essential for the development of separation methods. Investigation of the effect of various parameters such as temperature, pressure, etc. is necessary to deal with the separation of CO 2 from natural gas as well as the low adsorption of light hydrocarbons in ILs such as methane, ethane and propane. Therefore, evaluating the effect of temperature, pressure, anion and cation type, hydrocarbon type, and alkyl chain length can help researchers to select the appropriate ILs in different operating conditions. The findings of the study also show that ILs are suitable candidates for the solubility of hydrocarbons so they can be used in various operational applications.

Conclusions
In this work, the solubility of gaseous hydrocarbons in ILs was predicted using DBN, ELM, Boost-SVR, and MARS models. In this regard, the thermodynamic properties and chemical substructures of gaseous hydrocarbons and ILs along with temperature and pressure have been used in distinct methods to predict the solubility of gaseous hydrocarbons in ILs. The findings of this study give the following conclusions: • In the method based on thermodynamic properties, the DBN model with RMSE and R 2 values of 0.0054 and 0.9961, respectively, has the most accurate estimates. • In the method based on chemical structure, temperature, and pressure, the DBN model with RMSE and R 2 values of 0.0065 and 0.9943, respectively, has the best performance. • Comparison of the results of smart models with equations of state shows that the accuracy of the former is much higher than the latter. • The study of the effect of different factors on the solubility of gaseous hydrocarbons shows that increasing the pressure and molecular weight of hydrocarbons increases the solubility, while increasing the temperature decreases the solubility of gaseous hydrocarbons in ILs. • Investigation of the effect of the type of anion and cation in the structure of ILs shows that the following order have been established for the solubility of the studied systems:     www.nature.com/scientificreports/

Data availability
All the data have been collected from literature. We cited all the references of the data in the manuscript. However, the data will be available from the corresponding author on reasonable request.